home *** CD-ROM | disk | FTP | other *** search
- This is the on-line help file for CLUSTAL V.
-
- It should be named or defined as: clustalv_hlp
-
- >>HELP<< 1 General help for CLUSTAL V
- CLUSTAL V is a general purpose multiple alignment program for DNA or proteins.
-
- SEQUENCE INPUT: all sequences must be in 1 file, one after another. 3 formats
- are automatically recognised: NBRF/PIR, EMBL/SWISSPROT or Pearson (Fasta).
- All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
- except "-" which is used to indicate a GAP. Upper or lower case is allowed.
-
-
- To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to
- INPUT them; go to menu item 2 to do the multiple alignment.
-
-
- PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to
- add a new sequence to an old alignment. GAPS in the old alignments are
- indicated using the "-" character. PROFILES can be input as PIR format files.
-
-
- PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in
- in PIR format with "-" characters to indicate gaps) OR after a multiple
- alignemnt while the alignment is still in memory.
- >>HELP<< 2 Help for multiple alignments
-
- If you have already loaded sequences, use menu item 1 to do the complete
- multiple alignment. You will be prompted for 2 output files: 1 for the
- alignment itself; another to store a dendrogram that describes the similarity
- of the sequences to each other.
-
- Multiple alignments are carried out in 3 stages (automatically done from menu
- item 1 ... multiple alignments NOW):
-
- 1) all sequences are compared to each other (pairwise alignments);
-
- 2) a dendrogram (like a phylogenetic tree) is constructed, describing the
- approximate groupings of the sequences by similarity (stored in a file).
-
- 3) the final multiple alignment is carried out, using the dendrogram as a guide.
-
-
- PAIRWISE ALIGNMENT parameters control the speed/sensitivity of the initial
- alignments.
-
- MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.
-
-
-
-
- You can skip the first stages (pairwise alignments; dendrogram) by using an
- old dendrogram file (menu item 3); or you can just produce the dendrogram
- with no final multiple alignment (menu item 2).
-
-
- OUTPUT FORMAT: Menu item 6 (format options) allows you to choose between 4
- different alignment formats (CLUSTAL, GCG, NBRF/PIR and PHYLIP).
-
-
- >>HELP<< 3 Help for pairwise alignment parameters
-
- A similarity score is calculated between every pair of sequence and these are
- used to construct the dendrogram which guides the final multiple alignment.
-
- These similarity scores are calculated from fast, approximate, global align-
- ments, which are controlled by 4 parameters. 2 techniques are used to make
- these alignments very fast: 1) only exactly matching fragments (k-tuples) are
- considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
- are used.
-
-
- K-TUPLE SIZE: This is the size of exactly matching fragment that is used.
- INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
- For longer sequences (e.g. > 300 residues) you may need to increase the default.
-
-
- GAP PENALTY: This is a penalty for each gap in the fast alignments. It has
- little affect on the speed or sensitivity.
-
-
-
-
-
-
- TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
- dot-matrix plot) is calculated. Only the best ones (with most matches) are
- used in the alignment. This parameter specifies how many. Decrease for speed;
- increase for sensitivity.
-
-
- DIAGONAL WINDOW: This is the number of diagonals around each of the 'best'
- diagonals that will be used. Decrease for speed; increase for sensitivity.
-
-
- SCORING METHOD = PERCENTAGE or ABSOLUTE: This controls whether the similarity
- scores are calculated as raw alignment scores (number of k-tuple matches minus a
- gap penalty for every gap) (ABSOLUTE) or as the alignment score divided by the
- length of the shorter sequence (PERCENTAGE).
-
-
-
- >>HELP<< 4 Help for multiple alignment parameters
- These parameters control the final multiple alignment. There are 2 gap penalty
- parameters and 1 for whether transitions (A <--> G or C <--> T) are weighted in
- DNA alignments. The default weight matrix for protein alignments is a PAM250
- matrix, converted to distances.
-
- GAP PENALTY (FIXED): This is a penalty for opening up a gap. Decrease it
- and you will encourage gaps of all sizes. TERMINAL GAPS are penalised (same as
- internal ones). BEWARE: if you make this too small (+/- 5 or so), the program
- will prefer to align each sequence opposite a long gap.
-
- GAP PENALTY (VARYING): This penalty is incurred for every item in a gap. This
- penalises long gaps more. Increase this and gaps will get shorter. BEWARE:
- if you make this too small (+/- 5 or so), the program will prefer to align each
- sequence opposite a long gap.
-
- TRANSITIONS = WEIGHTED or UNWEIGHTED: With UNWEIGHTED transitions identical
- bases in a DNA alignment have a DISTANCE of 0; different ones have a distance
- of 10. If transitions are WEIGHTED then A vs G and C vs T will have a distance
- of 5 (less distant than A vs C,T or C vs A,G).
- >>HELP<< 5 Help for output format options.
- Four output formats are offered. You can choose more than one (or all four if
- you wish). NBRF/PIR format is ESPECIALLY USEFUL. Alignments that are written
- in this format can be used again as input (for calculating phylogenetic trees;
- profile alignments; general input).
-
- CLUSTAL format output is a self explanatory alignment format. It shows the
- sequences aligned in blocks.
-
- GCG output can be used by any of the GCG programs that can work on multiple
- alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG
- .msf format files (multiple sequence file); new in version 7 of GCG.
-
- PHYLIP format output can be used for input to the PHYLIP package of Joe
- Felsenstein. This is an extremely widely used package for doing every
- imaginable form of phylogenetic analysis (MUCH more than the the modest intro-
- duction offered by this program).
-
- NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap
- characters "-" are used to indicate the positions of gaps in the multiple
- alignment. These files can be re-used as input in any part of clustal that
- allows sequences (or alignments or profiles) to be read in.
- >>HELP<< 6 Help for profile alignments
-
- By PROFILE ALIGNMENT, we mean the alignment of two old alignments. One of the
- alignments can be a single sequence.
-
- The profiles should be in PIR format (one of the 4 output formats produced by
- this program). This is the same as standard NBRF/PIR format, with 1 addition:
- gap characters are indicated by "-".
-
- The alignment method produces a global, optimal alignment using an amino acid
- weight matrix (PAM250 is default) and 2 gap penalty parameters.
-
- Profile alignments allow you to store alignments of your favourite sequences (as
- long as they are in PIR format) and add new sequences to them in small bunches
- at a time. One of the 2 profiles can simply be a single sequence.
-
-
-
- >>HELP<< 7 Help for phylogenetic trees
- Before calculating a tree, you must have an alignment in memory. This can be
- input in NBRF/PIR format or you should have just carried out a full multiple
- alignment and the alignment is still in memory.
-
- The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First
- you calculate distances (percent divergence) between all pairs of sequence from
- a multiple alignment; second you apply the NJ method to the distance matrix.
-
- EXCLUDE POSITIONS WITH GAPS? If you choose this option, any alignment positions
- where ANY of the sequences have a gap will be ignored. This guarantees that
- the distances will be 'metric'. Also, it means that 'like' will be compared to
- 'like' in all distances. The disadvantage is that you may throw away much of
- the data if there are many gaps.
-
- CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this
- option makes little difference. For greater divergence, this option corrects
- for the fact that observed distances underestimate actual evolutionary dist-
- ances. This is because, as sequences diverge, more than one substitution will
- happen at many sites. However, you only see one difference when you look at the
- present day sequences. Therefore, this option has the effect of stretching
- branch lengths in trees (especially long branches). The corrections used here
- (for DNA or proteins) are both due to Motoo Kimura.
-
- To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED
- tree and all branch lengths. The root of the tree can only be inferred by
- using an outgroup (a sequence that you are certain branches at the outside
- of the tree .... certain on biological grounds) OR if you assume a degree
- of constancy in the 'molecular clock', you can place the root along the
- longest branch.
-
- BOOTSTRAPPING is a method for deriving confidence values for the groupings in
- a tree (first adapted for trees by Joe Felsenstein). It involves making N
- random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000);
- drawing N trees (1 from each sample) and counting how many times each grouping
- from the original tree occurs in the sample trees. For a group to be consid-
- ered significant at the 5% level (p <= 0.05) it should occur in at least 95% of
- the sample trees. You must supply a seed number for the random number generator.
- >>HELP<< 8 Help for choosing protein weight matrix
- For protein alignments, you use a weight matrix to determine the similarity of
- non-identical amino acids. For example, Tyr aligned with Phe is usually judged
- to be 'better' than Tyr aligned with Pro.
-
-
-
- There are three 'in-built' weight matrices offered:
-
-
- 1) PAM 100 and 2) PAM 250 These are from the work of M. Dayhoff and are often
- simply called Dayhoff matrices. The pam 250 matrix is the most commonly used
- and is the default in most protein comparison packages. It is claimed that
- a pam 100 matrix is more sensitive in many cases, so we have included it
- here.
-
-
- 3) Identity matrix. This matrix just scores identical residues.
-
-
-
-
-
- You can also input your own matrix. If so then be careful: 1) follow the
- instructions on format below; 2) watch the gap penalty parameters (the default
- values may no be appropriate). Conservative substitutions will not be
- indicated in alignments.
-
- The values in a new weight matrix must be integers and the scores should be
- similarities. You can use negative as well as positive values if you wish.
-
-
- INPUT FORMAT The lower triangle of a 20x20 matrix of values is read in, in free
- format, row by row. The diagonal must be included. Using the 1 letter code,
- the order of amino acids in the matrix is: CSTPAGNDEQHRKMILVFYW. Seperate
- the values by spaces (not commas). You can put the values on as many lines
- as you like as long as they are in the right order.
-
-
- GAP PENALTIES The default gap penalty parameters work fine with a PAM 250
- matrix. The range of PAM 250 values is 0 to 25 (when rescaled to be positive)
- and the default gap penalties are 10 each. Very approximately, the best gap
- penalty settings are 2/5 the maximum weight matrix score.
- >>HELP<< 9 Help for command line parameters
- DATA (sequences)
-
- /INFILE=file.ext :input sequences.
- /PROFILE1=file.ext and /PROFILE2=file.ext :profiles (old alignment).
-
- VERBS (do things)
-
- /HELP or /CHECK :list the command line params.
- /ALIGN :do full multiple alignment
- /TREE :calculate NJ tree.
- /BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
-
- PARAMETERS (set things)
-
- ***Pairwise alignments:***
- /KTUP=n :word size /TOPDIAGS=n :number of best diags.
- /WINDOW=n :window around best diags. /PAIRGAP=n :gap penalty
-
- ***Multiple alignments:***
- /FIXEDGAP=n :fixed length gap pen. /FLOATGAP=n :variable length gap pen.
- /MATRIX= :PAM100 or ID or file name. /TYPE=p or d :type is prot. or DNA
- /OUTPUT= :GCG or PHYLIP or PIR. /TRANSIT :transitions not weighted.
-
- ***Trees:*** /SEED :seed number for bootstraps.
- /KIMURA :use Kimura's correction. /TOSSGAPS :ignore positions with gaps.
-